AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.
# this will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
loan_df = pd.read_csv("Loan_Modelling.csv")
# copying data to another variable to avoid any changes to original data
data = loan_df.copy()
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.shape
(5000, 14)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
print(data.columns)
Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
data.isnull().sum().sort_values(ascending=False)
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
# checking for duplicate values in the data
data.duplicated().sum()
0
# let's check the summary of our data
data.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Lets us look at different levels in categorical variables
age: Average age of people in the dataset is 45 years, age has a wide range from 35 to 67 years.education_no_of_years: The average education in years is 10 years. There's a large difference between the minimum value and 25th percentile which indicates that there might be outliers present in this variable.capital_gain: There's a huge difference in the 75th percentile and maximum value of capital_gain indicating the presence of outliers. Also, 75% of the observations are 0.capital_loss: Same as capital gain there's a huge difference in the 75th percentile and maximum value indicating the presence of outliers. Also, 75% of the observations are 0.working_hours_per_week: On average people work for 40 hours a week. A vast difference in minimum value and 25th percentile, as well as 75th percentile and the maximum value, indicates that there might be outliers present in the variable.category_cols = ['Personal_Loan','Securities_Account','CD_Account','Online','CreditCard','Education']
def category_convert(category_cols):
for colname in category_cols:
data[colname] = data[colname].astype('category')
category_convert(category_cols)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category dtypes: category(6), float64(1), int64(7) memory usage: 342.7 KB
for i in category_cols:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Personal_Loan are : 0 4520 1 480 Name: Personal_Loan, dtype: int64 ************************************************** Unique values in Securities_Account are : 0 4478 1 522 Name: Securities_Account, dtype: int64 ************************************************** Unique values in CD_Account are : 0 4698 1 302 Name: CD_Account, dtype: int64 ************************************************** Unique values in Online are : 1 2984 0 2016 Name: Online, dtype: int64 ************************************************** Unique values in CreditCard are : 0 3530 1 1470 Name: CreditCard, dtype: int64 ************************************************** Unique values in Education are : 1 2096 3 1501 2 1403 Name: Education, dtype: int64 **************************************************
data.Family.unique()
array([4, 3, 1, 2])
data.Education.unique()
[1, 2, 3] Categories (3, int64): [1, 2, 3]
*
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "Income", bins=100)
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
histogram_boxplot(data, "Age", bins=50)
histogram_boxplot(data, "Income", bins=50)
histogram_boxplot(data, "Mortgage", bins=50)
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
numerical_col = data.select_dtypes(include=np.number).columns.tolist()
data = treat_outliers_all(data, numerical_col)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "Education", perc=True)
labeled_barplot(data, "Family", perc=True)
labeled_barplot(data, "Personal_Loan", perc=True)
labeled_barplot(data, "Personal_Loan", perc=True)
labeled_barplot(data, "Securities_Account", perc=True)
labeled_barplot(data, "Online", perc=True)
labeled_barplot(data, "CreditCard", perc=True)
labeled_barplot(data, "CD_Account", perc=True)
labeled_barplot(data, "Education", perc=True)
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# 'Personal_Loan','Securities_Account','CD_Account','Online','CreditCard','Education'
stacked_barplot(data, "Education", "Personal_Loan")
stacked_barplot(data, "Securities_Account", "Personal_Loan")
stacked_barplot(data, "CD_Account", "Personal_Loan")
stacked_barplot(data, "Online", "Personal_Loan")
stacked_barplot(data, "CreditCard", "Personal_Loan")
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
sns.pairplot(loan_df, hue="Personal_Loan")
plt.show()
Data Description:
There are no missing values in the dataset.
Data Cleaning:*
We observed that all the observations where workclass = ? the values in the occupation are ?
Observations from EDA:
age: Average age of people in the dataset is 38 years, age has a wide range from 17 to 90 years.education_no_of_years: The average education in years is 10 years. There's a large difference between the minimum value and 25th percentile which indicates that there might be outliers present in this variable.capital_gain: There's a huge difference in the 75th percentile and maximum value of capital_gain indicating the presence of outliers. Also, 75% of the observations are 0.capital_loss: Same as capital gain there's a huge difference in the 75th percentile and maximum value indicating the presence of outliers. Also, 75% of the observations are 0.working_hours_per_week: On average people work for 40 hours a week. A vast difference in minimum value and 25th percentile, as well as 75th percentile and the maximum value, indicates that there might be outliers present in the variable.fnlwght: fnlwght is right-skewed. It has lots of outliers on the right side which we can cap.working_hours_per_week: Most of the data is concentrated around 40 working hours this gives a sense that most of the observations in data might be salaried employees working 8hrs 5 days a week. Some of the observations are clear outliers like working 1 hour a week which needs to be treated.workclass: ~70% of the observations are from the Private sector working class.marita_status: 47.3% of the observations in the dataset are married followed by 32.8% of the people who never married.race: * 94% of the people are native to north_america followed by 2.1% Asians.salary vs sex: ~25% of the males have salary >50K whereas only ~15% of the females have salary >50K.salary vs occupation: - ~50% of the people whose occupation is of Executive Manager, Professor-speciality have a salary above 50,000 dollars. People with occupations like Private house service,handlers-cleaners, farming-fishing have a higher likelihood of having below 50K salary.salary vs education: ~70% of the people of are Doctorate, graduate from Professional school program (Prof-school) have a salary above 50K dollarssalary vs workclass: ~50% of the self-employed people have a salary above 50K followed by ~40% of the federal govt employee who has salary more than 50K. ~20% of the people working in the private sector earn more than 50K.salary vs working_hours_per_week: Majority of people having above 50K salary work around 40 hours per week.### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null float64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null float64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category dtypes: category(6), float64(3), int64(5) memory usage: 342.7 KB
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
Data Description:
Data Cleaning:
Observations from EDA:
age: Average age of people in the dataset is 38 years, age has a wide range from 17 to 90 years.education_no_of_years: The average education in years is 10 years. There's a large difference between the minimum value and 25th percentile which indicates that there might be outliers present in this variable.capital_gain: There's a huge difference in the 75th percentile and maximum value of capital_gain indicating the presence of outliers. Also, 75% of the observations are 0.capital_loss: Same as capital gain there's a huge difference in the 75th percentile and maximum value indicating the presence of outliers. Also, 75% of the observations are 0.working_hours_per_week: On average people work for 40 hours a week. A vast difference in minimum value and 25th percentile, as well as 75th percentile and the maximum value, indicates that there might be outliers present in the variable.fnlwght: fnlwght is right-skewed. It has lots of outliers on the right side which we can cap.working_hours_per_week: Most of the data is concentrated around 40 working hours this gives a sense that most of the observations in data might be salaried employees working 8hrs 5 days a week. Some of the observations are clear outliers like working 1 hour a week which needs to be treated.workclass: ~70% of the observations are from the Private sector working class.marita_status: 47.3% of the observations in the dataset are married followed by 32.8% of the people who never married.race: * 94% of the people are native to north_america followed by 2.1% Asians.salary vs sex: ~25% of the males have salary >50K whereas only ~15% of the females have salary >50K.salary vs occupation: - ~50% of the people whose occupation is of Executive Manager, Professor-speciality have a salary above 50,000 dollars. People with occupations like Private house service,handlers-cleaners, farming-fishing have a higher likelihood of having below 50K salary.salary vs education: ~70% of the people of are Doctorate, graduate from Professional school program (Prof-school) have a salary above 50K dollarssalary vs workclass: ~50% of the self-employed people have a salary above 50K followed by ~40% of the federal govt employee who has salary more than 50K. ~20% of the people working in the private sector earn more than 50K.salary vs working_hours_per_week: Majority of people having above 50K salary work around 40 hours per week.Dropping capital_gain and capital_loss
Encoding >50K as 0 and <=50K as 1 as government wants to find underprivileged section of society.
data=data.drop(["ID"], axis=1)
data.columns
Index(['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard'],
dtype='object')
data["Personal_Loan"]=data["Personal_Loan"].astype('int64')
data["Personal_Loan"].unique()
array([0, 1])
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null float64 3 ZIPCode 5000 non-null int64 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null float64 8 Personal_Loan 5000 non-null int64 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(5), float64(3), int64(5) memory usage: 337.7 KB
Creating training and test sets.
X = data.drop(["Personal_Loan"], axis=1)
Y = data["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 13) Shape of test set : (1500, 13) Percentage of classes in training set: 0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.900667 1 0.099333 Name: Personal_Loan, dtype: float64
Both the cases are important as:
If we predict a person has a salary <=50K but actually the salary is >50K then a wrong person will be getting the benefits of the scheme and government might lose resources.
If we predict a person doesn't have a salary <=50K but actually the salary is <=50K that person will not be able to avail the benefits of the scheme constituting to opportunity loss.
f1_score should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly.# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built using sklearn
def confusion_matrix_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix, based on the threshold specified, with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(X_train, y_train)
# let us check the coefficients and intercept of the model
coef_df = pd.DataFrame(
np.append(lg.coef_, lg.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df.T
| Age | Experience | Income | ZIPCode | Family | CCAvg | Mortgage | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | Intercept | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coefficients | -0.054155 | 0.060643 | 0.055701 | -0.000084 | 0.676851 | 0.42196 | 0.001326 | 3.41879 | 3.527992 | -0.723174 | 3.027986 | -0.573594 | -0.814464 | -4.088152 |
Odds from coefficients
# converting coefficients to odds
odds = np.exp(lg.coef_[0])
# finding the percentage change
perc_change_odds = (np.exp(lg.coef_[0]) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train.columns).T
| Age | Experience | Income | ZIPCode | Family | CCAvg | Mortgage | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.947285 | 1.062519 | 1.057282 | 0.999916 | 1.967671 | 1.524948 | 1.001327 | 30.532459 | 34.055525 | 0.485210 | 20.655590 | 0.563497 | 0.442877 |
| Change_odd% | -5.271480 | 6.251902 | 5.728170 | -0.008376 | 96.767098 | 52.494766 | 0.132729 | 2953.245863 | 3305.552540 | -51.479006 | 1965.558983 | -43.650332 | -55.712315 |
age: Holding all other features constant a 1 unit change in Age will decrease the odds of a person having salary <=50k by 0.97 times or a 2.90% decrease in odds of having salary <=50K.education_no_of_years: Holding all other features constant a 1 unit change in the education_no_of_years will decrease the odds of a person having salary <=50k by 0.75 times or a 24.7% decrease in odds of having salary <=50K.working_hours_per_week: Holding all other features constant a 1 unit change in the working_hours_per_week will decrease the odds of a person having salary <=50k by 0.93 times or a decrease of 6.50% increase in odds of having salary <=50K.Interpretation for other attributes can be done similarly.
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_train, y_train)
log_reg_model_train_perf = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961714 | 0.682779 | 0.886275 | 0.771331 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(lg, X_test, y_test)
log_reg_model_test_perf = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test
)
print("Test set performance:")
log_reg_model_test_perf
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958 | 0.657718 | 0.890909 | 0.756757 |
logit_roc_auc_train = roc_auc_score(y_train, lg.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(y_test, lg.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.09417281086602405
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
# Accuracy Recall Precision F1
# 0 0.780405 0.751019 0.946422 0.837474
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.901714 | 0.912387 | 0.489465 | 0.637131 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_auc_roc
)
print("Test set performance:")
log_reg_model_test_perf_threshold_auc_roc
# Accuracy Recall Precision F1
# 0 0.779916 0.752848 0.952413 0.840953
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.906 | 0.892617 | 0.515504 | 0.653563 |
y_scores = lg.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.35
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg, X_train, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
# Accuracy Recall Precision F1
# 0 0.958571 0.776435 0.783537 0.77997
# Accuracy Recall Precision F1
# 0 0.96 0.740181 0.819398 0.777778
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959429 | 0.776435 | 0.790769 | 0.783537 |
# creating confusion matrix
confusion_matrix_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_curve
)
log_reg_model_test_perf_threshold_curve = model_performance_classification_sklearn_with_threshold(
lg, X_test, y_test, threshold=optimal_threshold_curve
)
print("Test set performance:")
log_reg_model_test_perf_threshold_curve
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956 | 0.738255 | 0.80292 | 0.769231 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.5 Threshold",
"Logistic Regression-0.35 Threshold"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.5 Threshold | Logistic Regression-0.35 Threshold | |
|---|---|---|---|
| Accuracy | 0.961714 | 0.901714 | 0.959429 |
| Recall | 0.682779 | 0.912387 | 0.776435 |
| Precision | 0.886275 | 0.489465 | 0.790769 |
| F1 | 0.771331 | 0.637131 | 0.783537 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.5 Threshold",
"Logistic Regression-0.35 Threshold"
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.5 Threshold | Logistic Regression-0.35 Threshold | |
|---|---|---|---|
| Accuracy | 0.958000 | 0.906000 | 0.956000 |
| Recall | 0.657718 | 0.892617 | 0.738255 |
| Precision | 0.890909 | 0.515504 | 0.802920 |
| F1 | 0.756757 | 0.653563 | 0.769231 |
Why we should do feature selection?
# Sequential feature selector is present in mlxtend library
# !pip install mlxtend to install mlxtent library
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
# to plot the performance with addition of each feature
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
# from sklearn.linear_model import LogisticRegression
# Fit the model on train
model = LogisticRegression(solver="newton-cg", n_jobs=-1, random_state=1, max_iter=100)
X.shape
(5000, 13)
# we will first build model with all varaible
sfs = SFS(
model,
k_features=13,
forward=True,
floating=False,
scoring="f1",
verbose=2,
cv=3,
n_jobs=-1,
)
sfs = sfs.fit(X_train, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 13 out of 13 | elapsed: 2.1s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 13 out of 13 | elapsed: 2.1s finished [2021-12-17 23:11:00] Features: 1/13 -- score: 0.41343889085824576[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 0.9s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 0.9s finished [2021-12-17 23:11:01] Features: 2/13 -- score: 0.5343065333269479[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 1.0s finished [2021-12-17 23:11:02] Features: 3/13 -- score: 0.588710257767378[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.5s finished [2021-12-17 23:11:03] Features: 4/13 -- score: 0.7247668393782384[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 9 | elapsed: 1.2s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 1.4s finished [2021-12-17 23:11:05] Features: 5/13 -- score: 0.754399957882545[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 6 out of 8 | elapsed: 0.8s remaining: 0.3s [Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 1.1s finished [2021-12-17 23:11:06] Features: 6/13 -- score: 0.7617423782550928[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.8s remaining: 0.6s [Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.2s finished [2021-12-17 23:11:07] Features: 7/13 -- score: 0.7673208908661803[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 6 | elapsed: 0.4s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 1.0s finished [2021-12-17 23:11:08] Features: 8/13 -- score: 0.7670844795354087[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 5 out of 5 | elapsed: 1.1s finished [2021-12-17 23:11:09] Features: 9/13 -- score: 0.767626850280264[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 4 | elapsed: 0.9s finished [2021-12-17 23:11:10] Features: 10/13 -- score: 0.7655549555325173[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.9s finished [2021-12-17 23:11:11] Features: 11/13 -- score: 0.7616776053393486[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 1.3s finished [2021-12-17 23:11:12] Features: 12/13 -- score: 0.7583900382472043[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.9s finished [2021-12-17 23:11:13] Features: 13/13 -- score: 0.7527401837928154
fig1 = plot_sfs(sfs.get_metric_dict(), kind="std_dev", figsize=(12, 5))
plt.ylim([0.8, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.xticks(rotation=90)
plt.show()
sfs1 = SFS(
model,
k_features=8,
forward=True,
floating=False,
scoring="f1",
verbose=2,
cv=3,
n_jobs=-1,
)
sfs1 = sfs1.fit(X_train, y_train)
fig1 = plot_sfs(sfs1.get_metric_dict(), kind="std_dev", figsize=(10, 5))
plt.ylim([0.8, 1])
plt.title("Sequential Forward Selection (w. StdDev)")
plt.grid()
plt.show()
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 13 out of 13 | elapsed: 0.5s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 13 out of 13 | elapsed: 0.5s finished [2021-12-17 23:13:23] Features: 1/8 -- score: 0.41343889085824576[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 1.2s remaining: 0.0s [Parallel(n_jobs=-1)]: Done 12 out of 12 | elapsed: 1.2s finished [2021-12-17 23:13:24] Features: 2/8 -- score: 0.5343065333269479[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 11 out of 11 | elapsed: 1.0s finished [2021-12-17 23:13:25] Features: 3/8 -- score: 0.588710257767378[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 1.4s finished [2021-12-17 23:13:26] Features: 4/8 -- score: 0.7247668393782384[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 7 out of 9 | elapsed: 1.5s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 1.8s finished [2021-12-17 23:13:28] Features: 5/8 -- score: 0.754399957882545[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 6 out of 8 | elapsed: 1.1s remaining: 0.4s [Parallel(n_jobs=-1)]: Done 8 out of 8 | elapsed: 1.8s finished [2021-12-17 23:13:30] Features: 6/8 -- score: 0.7617423782550928[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.9s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.3s finished [2021-12-17 23:13:31] Features: 7/8 -- score: 0.7673208908661803[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 3 out of 6 | elapsed: 0.5s remaining: 0.5s [Parallel(n_jobs=-1)]: Done 6 out of 6 | elapsed: 1.1s finished [2021-12-17 23:13:32] Features: 8/8 -- score: 0.7670844795354087
Finding which features are important?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)
[2, 4, 6, 7, 8, 9, 10, 11]
Let's look at best 8 variables
X_train.columns[feat_cols]
Index(['Income', 'Family', 'Mortgage', 'Education_2', 'Education_3',
'Securities_Account_1', 'CD_Account_1', 'Online_1'],
dtype='object')
X_train_final = X_train[X_train.columns[feat_cols]]
# Creating new x_test with the same variables that we selected for x_train
X_test_final = X_test[X_train_final.columns]
# Fitting logistic regession model
logreg = LogisticRegression(
solver="newton-cg", penalty="none", verbose=True, n_jobs=-1, random_state=0
)
# There are several optimizer, we are using optimizer called as 'newton-cg' with max_iter equal to 10000
# max_iter indicates number of iteration needed to converge
logreg.fit(X_train_final, y_train)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.1s finished
LogisticRegression(n_jobs=-1, penalty='none', random_state=0,
solver='newton-cg', verbose=True)
confusion_matrix_sklearn_with_threshold(logreg, X_train_final, y_train)
log_reg_model_train_perf_SFS = model_performance_classification_sklearn_with_threshold(
logreg, X_train_final, y_train
)
print("Training performance:")
log_reg_model_train_perf_SFS
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963143 | 0.691843 | 0.894531 | 0.780239 |
confusion_matrix_sklearn_with_threshold(logreg, X_test_final, y_test)
log_reg_model_test_perf_SFS = model_performance_classification_sklearn_with_threshold(
logreg, X_test_final, y_test
)
print("Test set performance:")
log_reg_model_test_perf_SFS
Test set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958 | 0.651007 | 0.898148 | 0.754864 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
log_reg_model_train_perf_SFS.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.5 Threshold",
"Logistic Regression-0.35 Threshold",
"Logistic Regression - SFS",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.5 Threshold | Logistic Regression-0.35 Threshold | Logistic Regression - SFS | |
|---|---|---|---|---|
| Accuracy | 0.961714 | 0.901714 | 0.959429 | 0.963143 |
| Recall | 0.682779 | 0.912387 | 0.776435 | 0.691843 |
| Precision | 0.886275 | 0.489465 | 0.790769 | 0.894531 |
| F1 | 0.771331 | 0.637131 | 0.783537 | 0.780239 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-0.5 Threshold",
"Logistic Regression-0.35 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression sklearn | Logistic Regression-0.5 Threshold | Logistic Regression-0.35 Threshold | |
|---|---|---|---|
| Accuracy | 0.958000 | 0.906000 | 0.956000 |
| Recall | 0.657718 | 0.892617 | 0.738255 |
| Precision | 0.890909 | 0.515504 | 0.802920 |
| F1 | 0.756757 | 0.653563 | 0.769231 |
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
loan_data=data.copy()
loan_data.head(10)
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49.0 | 91107 | 4 | 1.6 | 1 | 0.0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34.0 | 90089 | 3 | 1.5 | 1 | 0.0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11.0 | 94720 | 1 | 1.0 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100.0 | 94112 | 1 | 2.7 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45.0 | 91330 | 4 | 1.0 | 2 | 0.0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 37 | 13 | 29.0 | 92121 | 4 | 0.4 | 2 | 155.0 | 0 | 0 | 0 | 1 | 0 |
| 6 | 53 | 27 | 72.0 | 91711 | 2 | 1.5 | 2 | 0.0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 50 | 24 | 22.0 | 93943 | 1 | 0.3 | 3 | 0.0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 35 | 10 | 81.0 | 90089 | 3 | 0.6 | 2 | 104.0 | 0 | 0 | 0 | 1 | 0 |
| 9 | 34 | 9 | 180.0 | 93023 | 1 | 5.2 | 3 | 0.0 | 1 | 0 | 0 | 0 | 0 |
print(loan_data.Experience.value_counts())
print(loan_data.Age.value_counts())
print(loan_data.Income.value_counts())
print(loan_data.ZIPCode.value_counts())
print(loan_data.Family.value_counts())
print(loan_data.CCAvg.value_counts())
print(loan_data.Education.value_counts())
print(loan_data.Mortgage.value_counts())
print(loan_data.Personal_Loan.value_counts())
print(loan_data.Securities_Account.value_counts())
print(loan_data.CD_Account.value_counts())
print(loan_data.Online.value_counts())
print(loan_data.CreditCard.value_counts())
32 154
20 148
9 147
5 146
23 144
35 143
25 142
28 138
18 137
19 135
26 134
24 131
3 129
14 127
16 127
30 126
34 125
17 125
27 125
29 124
22 124
7 121
8 119
15 119
6 119
10 118
33 117
13 117
11 116
37 116
36 114
21 113
4 113
31 104
12 102
38 88
2 85
39 85
1 74
0 66
40 57
41 43
42 8
43 3
Name: Experience, dtype: int64
35 151
43 149
52 145
54 143
58 143
50 138
41 136
30 136
56 135
34 134
39 133
59 132
57 132
51 129
45 127
46 127
60 127
42 126
55 125
40 125
31 125
62 123
61 122
44 121
29 120
33 120
32 120
48 118
38 115
49 115
47 113
53 112
63 108
36 107
37 106
28 102
27 91
65 80
64 78
26 77
25 35
66 24
67 12
24 11
Name: Age, dtype: int64
186.5 96
44.0 84
38.0 82
41.0 81
81.0 81
39.0 79
42.0 77
40.0 77
83.0 74
43.0 69
45.0 68
29.0 67
85.0 65
22.0 65
25.0 64
21.0 64
35.0 63
84.0 63
30.0 63
28.0 62
78.0 60
55.0 60
64.0 60
65.0 60
82.0 58
32.0 58
61.0 56
53.0 56
58.0 55
31.0 55
62.0 54
80.0 54
23.0 54
34.0 53
59.0 53
18.0 53
79.0 53
19.0 52
60.0 52
54.0 52
49.0 52
33.0 51
24.0 47
20.0 47
70.0 47
52.0 47
63.0 46
69.0 45
75.0 45
74.0 45
73.0 43
50.0 43
71.0 42
48.0 42
72.0 41
51.0 40
90.0 38
93.0 37
91.0 37
68.0 35
89.0 34
15.0 33
113.0 32
14.0 31
114.0 30
13.0 30
92.0 29
12.0 28
98.0 28
115.0 27
11.0 27
88.0 26
9.0 26
94.0 26
112.0 25
95.0 25
99.0 24
122.0 24
128.0 24
141.0 24
8.0 23
10.0 23
145.0 23
129.0 23
101.0 22
111.0 22
125.0 22
154.0 21
134.0 20
104.0 20
105.0 20
121.0 20
140.0 19
110.0 19
130.0 19
131.0 19
155.0 19
119.0 18
149.0 18
158.0 18
180.0 18
103.0 18
123.0 18
118.0 18
132.0 18
138.0 18
120.0 17
179.0 17
135.0 17
109.0 17
161.0 16
108.0 16
102.0 16
133.0 15
139.0 15
152.0 15
142.0 15
164.0 13
173.0 13
182.0 13
183.0 12
184.0 12
160.0 12
175.0 12
124.0 12
170.0 12
165.0 11
148.0 11
172.0 11
153.0 11
162.0 10
150.0 10
100.0 10
178.0 10
163.0 9
143.0 9
171.0 9
174.0 9
185.0 9
168.0 8
181.0 8
159.0 7
169.0 7
144.0 7
151.0 4
Name: Income, dtype: int64
94720 164
94305 125
95616 115
90095 71
93106 56
...
93077 1
92694 1
94965 1
91024 1
94598 1
Name: ZIPCode, Length: 467, dtype: int64
1 1470
2 1274
4 1203
3 1001
Name: Family, dtype: int64
5.20 335
0.30 238
1.00 225
0.20 203
0.80 187
2.00 184
0.10 183
1.50 178
0.40 177
0.70 169
0.50 163
1.70 155
1.80 148
1.40 136
2.20 130
1.30 128
0.60 116
2.80 110
2.50 107
0.00 106
0.90 106
1.90 106
2.10 97
1.60 97
2.40 86
2.60 85
1.10 84
1.20 66
2.70 58
2.30 55
2.90 54
3.00 53
3.30 45
3.80 43
3.40 39
2.67 36
4.00 33
4.50 29
3.60 27
3.90 27
4.30 26
3.70 25
4.70 24
3.20 22
4.10 22
4.90 22
3.10 20
0.67 18
1.67 18
5.00 18
2.33 18
4.40 17
3.50 15
4.60 14
4.20 11
4.33 9
0.75 9
1.33 9
4.80 7
5.10 6
1.75 5
4.25 2
4.75 2
3.25 1
4.67 1
3.67 1
3.33 1
2.75 1
Name: CCAvg, dtype: int64
1 2080
3 1481
2 1387
Name: Education, dtype: int64
0.0 3422
252.5 288
98.0 17
91.0 16
119.0 16
83.0 16
103.0 16
102.0 15
89.0 15
78.0 15
104.0 14
94.0 14
101.0 14
87.0 14
131.0 14
118.0 14
90.0 13
116.0 13
109.0 13
81.0 13
112.0 13
144.0 13
106.0 13
185.0 12
76.0 12
121.0 12
97.0 12
100.0 12
120.0 12
95.0 11
158.0 11
79.0 11
86.0 11
184.0 11
115.0 11
137.0 11
111.0 11
153.0 11
142.0 10
151.0 10
117.0 10
113.0 10
161.0 10
108.0 10
84.0 10
135.0 10
82.0 10
128.0 9
88.0 9
221.0 9
167.0 9
166.0 9
123.0 9
148.0 9
146.0 9
159.0 9
149.0 9
207.0 8
157.0 8
218.0 8
132.0 8
174.0 8
205.0 8
75.0 8
170.0 8
122.0 8
147.0 8
140.0 8
164.0 8
194.0 8
110.0 8
105.0 8
114.0 8
169.0 8
138.0 8
126.0 7
127.0 7
193.0 7
80.0 7
85.0 7
249.0 7
204.0 7
229.0 7
230.0 7
219.0 7
196.0 7
154.0 7
124.0 7
96.0 7
129.0 7
192.0 6
125.0 6
187.0 6
107.0 6
134.0 6
141.0 6
130.0 6
136.0 6
240.0 6
150.0 6
99.0 6
155.0 6
251.0 6
163.0 6
236.0 6
182.0 6
178.0 5
212.0 5
171.0 5
217.0 5
188.0 5
209.0 5
156.0 5
220.0 5
227.0 5
198.0 5
145.0 5
172.0 5
180.0 5
139.0 4
162.0 4
200.0 4
181.0 4
224.0 4
179.0 4
189.0 4
175.0 4
77.0 4
245.0 4
199.0 4
190.0 4
203.0 4
152.0 4
211.0 4
247.0 4
231.0 4
244.0 4
92.0 4
232.0 4
93.0 4
186.0 3
242.0 3
143.0 3
226.0 3
223.0 3
165.0 3
177.0 3
216.0 3
241.0 3
168.0 3
208.0 3
215.0 3
197.0 3
228.0 3
214.0 3
133.0 3
248.0 3
213.0 3
160.0 3
183.0 2
202.0 2
239.0 2
225.0 2
243.0 2
252.0 2
246.0 2
238.0 2
201.0 2
176.0 2
222.0 2
234.0 2
233.0 2
250.0 2
195.0 1
237.0 1
191.0 1
235.0 1
206.0 1
210.0 1
173.0 1
Name: Mortgage, dtype: int64
0 4468
1 480
Name: Personal_Loan, dtype: int64
0 4432
1 516
Name: Securities_Account, dtype: int64
0 4646
1 302
Name: CD_Account, dtype: int64
1 2954
0 1994
Name: Online, dtype: int64
0 3493
1 1455
Name: CreditCard, dtype: int64
loan_data = data[data["Experience"] >= 0]
loan_data.count().sum()
loan_data.shape
(4948, 13)
loan_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4948 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 4948 non-null int64 1 Experience 4948 non-null int64 2 Income 4948 non-null float64 3 ZIPCode 4948 non-null int64 4 Family 4948 non-null int64 5 CCAvg 4948 non-null float64 6 Education 4948 non-null category 7 Mortgage 4948 non-null float64 8 Personal_Loan 4948 non-null int64 9 Securities_Account 4948 non-null category 10 CD_Account 4948 non-null category 11 Online 4948 non-null category 12 CreditCard 4948 non-null category dtypes: category(5), float64(3), int64(5) memory usage: 372.7 KB
oneHotCols=["Securities_Account","CD_Account","Online","CreditCard","Education"]
loan_data=pd.get_dummies(loan_data, columns=oneHotCols)
loan_data.head()
| Age | Experience | Income | ZIPCode | Family | CCAvg | Mortgage | Personal_Loan | Securities_Account_0 | Securities_Account_1 | CD_Account_0 | CD_Account_1 | Online_0 | Online_1 | CreditCard_0 | CreditCard_1 | Education_1 | Education_2 | Education_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49.0 | 91107 | 4 | 1.6 | 0.0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 45 | 19 | 34.0 | 90089 | 3 | 1.5 | 0.0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 39 | 15 | 11.0 | 94720 | 1 | 1.0 | 0.0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 35 | 9 | 100.0 | 94112 | 1 | 2.7 | 0.0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 35 | 8 | 45.0 | 91330 | 4 | 1.0 | 0.0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
loan_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4948 entries, 0 to 4999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 4948 non-null int64 1 Experience 4948 non-null int64 2 Income 4948 non-null float64 3 ZIPCode 4948 non-null int64 4 Family 4948 non-null int64 5 CCAvg 4948 non-null float64 6 Mortgage 4948 non-null float64 7 Personal_Loan 4948 non-null int64 8 Securities_Account_0 4948 non-null uint8 9 Securities_Account_1 4948 non-null uint8 10 CD_Account_0 4948 non-null uint8 11 CD_Account_1 4948 non-null uint8 12 Online_0 4948 non-null uint8 13 Online_1 4948 non-null uint8 14 CreditCard_0 4948 non-null uint8 15 CreditCard_1 4948 non-null uint8 16 Education_1 4948 non-null uint8 17 Education_2 4948 non-null uint8 18 Education_3 4948 non-null uint8 dtypes: float64(3), int64(5), uint8(11) memory usage: 401.1 KB
X = loan_data.drop("Personal_Loan" , axis=1)
y = loan_data.pop("Personal_Loan")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
# dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
# dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 3463 Number of rows in test data = 1485
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Percentage of classes in training set: 0 0.903263 1 0.096737 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.902357 1 0.097643 Name: Personal_Loan, dtype: float64
We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become bia
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
## Function to calculate recall score
def decision_get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
## Function to create confusion matrix
def decision_make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
decision_make_confusion_matrix(dTree,y_test)
# Recall on train and test
decision_get_recall_score(dTree)
Recall on training set : 1.0 Recall on test set : 0.8827586206896552
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account_0', 'Securities_Account_1', 'CD_Account_0', 'CD_Account_1', 'Online_0', 'Online_1', 'CreditCard_0', 'CreditCard_1', 'Education_1', 'Education_2', 'Education_3']
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2497.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- CCAvg <= 1.95 | | | | |--- CCAvg <= 1.85 | | | | | |--- Family <= 3.50 | | | | | | |--- Education_1 <= 0.50 | | | | | | | |--- Age <= 29.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 29.50 | | | | | | | | |--- Mortgage <= 231.00 | | | | | | | | | |--- CCAvg <= 1.15 | | | | | | | | | | |--- ZIPCode <= 92901.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- ZIPCode > 92901.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- CCAvg > 1.15 | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 231.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Education_1 > 0.50 | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Education_1 <= 0.50 | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | |--- Education_1 > 0.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- CCAvg > 1.85 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CCAvg > 1.95 | | | | |--- weights: [24.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_0 <= 0.50 | | | |--- Age <= 39.50 | | | | |--- Age <= 32.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- weights: [2.00, 0.00] class: 0 | | | |--- Age > 39.50 | | | | |--- ZIPCode <= 94555.50 | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | |--- ZIPCode > 94555.50 | | | | | |--- Family <= 2.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Family > 2.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account_0 > 0.50 | | | |--- Income <= 94.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- Income <= 81.50 | | | | | | |--- Mortgage <= 216.50 | | | | | | | |--- ZIPCode <= 91257.00 | | | | | | | | |--- ZIPCode <= 91159.50 | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | | |--- ZIPCode > 91159.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- ZIPCode > 91257.00 | | | | | | | | |--- weights: [55.00, 0.00] class: 0 | | | | | | |--- Mortgage > 216.50 | | | | | | | |--- Income <= 68.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Income > 68.00 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- Income > 81.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- Income <= 83.50 | | | | | | | | |--- Experience <= 20.00 | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Experience > 20.00 | | | | | | | | | |--- ZIPCode <= 94323.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- ZIPCode > 94323.50 | | | | | | | | | | |--- Online_0 <= 0.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- Online_0 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 83.50 | | | | | | | | |--- Experience <= 3.50 | | | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Online_1 > 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Experience > 3.50 | | | | | | | | | |--- weights: [34.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- Age <= 44.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Age > 44.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | |--- Income > 94.50 | | | | |--- Education_1 <= 0.50 | | | | | |--- ZIPCode <= 95092.00 | | | | | | |--- Family <= 2.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- Experience <= 28.50 | | | | | | | | | |--- Experience <= 3.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Experience > 3.50 | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- Experience > 28.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- Family > 2.50 | | | | | | | |--- CCAvg <= 3.45 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.45 | | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | |--- ZIPCode > 95092.00 | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- Education_1 > 0.50 | | | | | |--- ZIPCode <= 90167.00 | | | | | | |--- ZIPCode <= 90029.00 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- ZIPCode > 90029.00 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- ZIPCode > 90167.00 | | | | | | |--- ZIPCode <= 95164.00 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [22.00, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 4.20 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- ZIPCode > 95164.00 | | | | | | | |--- Experience <= 14.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Experience > 14.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 |--- Income > 113.50 | |--- Education_1 <= 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.15 | | | | |--- CCAvg <= 0.65 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- CCAvg > 0.65 | | | | | |--- weights: [9.00, 0.00] class: 0 | | | |--- CCAvg > 2.15 | | | | |--- Mortgage <= 126.25 | | | | | |--- CreditCard_1 <= 0.50 | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | |--- CreditCard_1 > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Mortgage > 126.25 | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [393.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 46.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education_1 0.394614 Income 0.307906 Family 0.143453 CCAvg 0.060162 CD_Account_0 0.023071 ZIPCode 0.019500 Experience 0.015646 Age 0.015256 Mortgage 0.005881 Education_2 0.005561 CreditCard_1 0.002892 Securities_Account_0 0.002203 Online_0 0.002203 Online_1 0.001652 Securities_Account_1 0.000000 CD_Account_1 0.000000 CreditCard_0 0.000000 Education_3 0.000000
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
decision_make_confusion_matrix(dTree1, y_test)
# Accuracy on train and test
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
# Recall on train and test
decision_get_recall_score(dTree1)
Accuracy on training set : 0.9841178169217442 Accuracy on test set : 0.9811447811447811 Recall on training set : 0.844776119402985 Recall on test set : 0.8275862068965517
plt.figure(figsize=(15,10))
tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree1,feature_names=feature_names,show_weights=True))
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2497.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- weights: [63.00, 8.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- weights: [161.00, 36.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- weights: [3.00, 15.00] class: 1 |--- Income > 113.50 | |--- Education_1 <= 0.50 | | |--- Income <= 116.50 | | | |--- weights: [11.00, 8.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 222.00] class: 1 | |--- Education_1 > 0.50 | | |--- Family <= 2.50 | | | |--- weights: [393.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 46.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education_1 0.433683 Income 0.338340 Family 0.159032 CCAvg 0.041985 CD_Account_1 0.026960 Age 0.000000 Online_0 0.000000 Education_2 0.000000 CreditCard_1 0.000000 CreditCard_0 0.000000 Online_1 0.000000 CD_Account_0 0.000000 Experience 0.000000 Securities_Account_1 0.000000 Securities_Account_0 0.000000 Mortgage 0.000000 ZIPCode 0.000000 Education_3 0.000000
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)
decision_make_confusion_matrix(estimator,y_test)
# Accuracy on train and test
print("Accuracy on training set : ",estimator.score(X_train, y_train))
print("Accuracy on test set : ",estimator.score(X_test, y_test))
# Recall on train and test
decision_get_recall_score(estimator)
Accuracy on training set : 0.9881605544325729 Accuracy on test set : 0.9838383838383838 Recall on training set : 0.9253731343283582 Recall on test set : 0.896551724137931
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
#Here we will see that importance of features has increased
Imp Education_1 0.435655 Income 0.336063 Family 0.153584 CCAvg 0.048663 CD_Account_0 0.026036 Online_0 0.000000 Education_2 0.000000 CreditCard_1 0.000000 CreditCard_0 0.000000 Online_1 0.000000 Age 0.000000 CD_Account_1 0.000000 Experience 0.000000 Securities_Account_1 0.000000 Securities_Account_0 0.000000 Mortgage 0.000000 ZIPCode 0.000000 Education_3 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations from the tree:
Using the above extracted decision rules we can make interpretations from the decision tree model like:
Interpretations from other decision rules can be made similarly
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000270 | 0.000539 |
| 2 | 0.000270 | 0.001617 |
| 3 | 0.000281 | 0.002179 |
| 4 | 0.000284 | 0.002748 |
| 5 | 0.000334 | 0.003749 |
| 6 | 0.000347 | 0.004442 |
| 7 | 0.000351 | 0.005494 |
| 8 | 0.000385 | 0.005879 |
| 9 | 0.000432 | 0.006742 |
| 10 | 0.000433 | 0.007175 |
| 11 | 0.000442 | 0.009387 |
| 12 | 0.000449 | 0.010285 |
| 13 | 0.000480 | 0.011245 |
| 14 | 0.000505 | 0.011751 |
| 15 | 0.000510 | 0.014303 |
| 16 | 0.000520 | 0.014823 |
| 17 | 0.000520 | 0.015343 |
| 18 | 0.000536 | 0.015879 |
| 19 | 0.000753 | 0.018138 |
| 20 | 0.000798 | 0.018936 |
| 21 | 0.000969 | 0.019905 |
| 22 | 0.001257 | 0.021162 |
| 23 | 0.002277 | 0.025716 |
| 24 | 0.003388 | 0.029104 |
| 25 | 0.004032 | 0.033136 |
| 26 | 0.006279 | 0.039415 |
| 27 | 0.023783 | 0.063198 |
| 28 | 0.055780 | 0.174758 |
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.055779975764704705
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0007984340047719738, random_state=1) Training accuracy of best model: 0.9887380883626913 Test accuracy of best model: 0.9838383838383838
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0007984340047719738, random_state=1)
decision_make_confusion_matrix(best_model,y_test)
# Recall on train and test
decision_get_recall_score(best_model)
Recall on training set : 0.9313432835820895 Recall on test set : 0.896551724137931
plt.figure(figsize=(17,15))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp Education_1 0.432945 Income 0.333972 Family 0.152628 CCAvg 0.048360 CD_Account_0 0.025874 Age 0.006220 Mortgage 0.000000 Securities_Account_0 0.000000 Securities_Account_1 0.000000 Experience 0.000000 CD_Account_1 0.000000 Online_0 0.000000 Online_1 0.000000 CreditCard_0 0.000000 CreditCard_1 0.000000 ZIPCode 0.000000 Education_2 0.000000 Education_3 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Decision tree with post-pruning is giving the highest recall on test set.